A morpho-graphemic approach for the recognition of spontaneous speech in agglutinative languages - like Hungarian

نویسندگان

  • Péter Mihajlik
  • Tibor Fegyó
  • Zoltán Tüske
  • Pavel Ircing
چکیده

A coupled acousticand language-modeling approach is presented for the recognition of spontaneous speech primarily in agglutinative languages. The effectiveness of the approach in large vocabulary spontaneous speech recognition is demonstrated on the Hungarian MALACH corpus. The derivation of morphs from word forms is based on a statistical morphological segmentation tool while the mapping of morphs into graphemes is obtained trivially by splitting each morph into individual letters. Using morphs instead of words in language modeling gives significant WER reductions in case of both phonemeand grapheme-based acoustic modeling. The improvements are larger after speaker adaptation of the acoustic models. In conclusion, morphophonemic and the proposed morpho-graphemic ASR approaches yield the same best WERs, which are significantly lower than the word-based baselines but essentially without language dependent rules or pronunciation dictionaries in the latter case.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Development of a Large Spontaneous Speech Database of Agglutinative Hungarian Language

In this paper, a large Hungarian spoken language database is introduced. This phonetically-based multi-purpose database contains various types of spontaneous and read speech from 333 monolingual speakers (about 50 minutes of speech sample per speaker). This study presents the background and motivation of the development of the BEA Hungarian database, describes its protocol and the transcription...

متن کامل

A Unification-based Approach to Morpho-syntactic Parsing of Agglutinative and Other (Highly) Inflectional Languages

This paper introduces a new approach to morpho-syntactic analysis through Humor 99 (High-speed Unification Mo.rphology), a reversible and unification-based morphological analyzer which has already been integrated with a variety of industrial applications. Humor 99 successfully copes with problems of agglutinative (e.g. Hungarian, Turkish, Estonian) and other (highly) inflectional languages (e.g...

متن کامل

Evaluating an Agglutinative Segmentation Model for ParaMor

This paper describes and evaluates a modification to the segmentation model used in the unsupervised morphology induction system, ParaMor. Our improved segmentation model permits multiple morpheme boundaries in a single word. To prepare ParaMor to effectively apply the new agglutinative segmentation model, two heuristics improve ParaMor’s precision. These precision-enhancing heuristics are adap...

متن کامل

Modelling of Filled Pauses and Onomatopoeias for Spontaneous Speech Recognition

With the growing availability of various content provided over state-of-the-art digital media is speech recognition becoming one of the main core technologies (Billi et al., 1997; Žgank et al., 2002; Gupta et al., 2000; Sket et al., 2002). Its task is to minimize the needed effort to access the particular part of content. The main content categories can be grouped in the following way: • broadc...

متن کامل

Prosodic Cues for Automatic Word Boundary Detection in ASR

This article presents a cross-lingual study for agglutinative, fixed stressed languages, like Hungarian and Finnish, about the segmentation of continuous speech on word level by examination of supra-segmental parameters. We have developed different algorithms based either on a rule based or a datadriven approach. The best results were obtained by data-driven algorithms (HMMbased methods) using ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007